面向视频中人体行为识别的复合型深度神经网络

doi:10.16451/j.cnki.issn1003-6059.202206008

摘要
图/表
参考文献
相关文章 (6)

全文: PDF (640 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要在视频人体行为识别研究中三维卷积神经网络和双流卷积神经网络均存在不足.针对此种情况,文中提出结合双流网络架构和三维网络架构的复合型深度神经网络.在双流架构的时间流子网络和空间流子网络部分均采用改进的R(2+1)D卷积神经网络,分别从视频的RGB图像序列和光流图像序列中学习行为表示和分类方法,并融合时间流子网络、空间流子网络的分类结果.进一步地,在网络训练过程中,提出基于梯度中心化算法改进的带动量的随机梯度下降算法,在不改变网络结构的情况下提高网络的泛化性能.实验表明,文中网络在UCF101、HMDB51数据集上均获得较高的识别精度.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	黄敏
	尚瑞欣
	钱惠敏

关键词 ：人体行为识别, 双流卷积网络, 三维卷积神经网络, 梯度中心化

Abstract：Aiming at the deficiencies of 3D convolutional neural network and two-stream convolutional neural network for human activities recognition in video, a composite deep neural network combining two-stream convolutional network and 3D convolutional network is proposed. The improved residual(2+1)D convolutional neural network is utilized in both the temporal sub-network and the spatial sub-network of two-stream architecture. Behavioral representation and classification methods are learned from RGB and optical flow of video, respectively. The classification results of temporal stream and spatial stream sub-networks are combined. Furthermore, in the process of network training, stochastic gradient descent with the momentum improved by gradient centralization algorithm is proposed to improve the network generalization performance without varying the network structure. Experimental results show that the proposed network achieves higher accuracy on UCF101 and HMDB51.

Key words： Human Activity Recognition Two-Stream Convolutional Network 3D Convolution Neural Network Gradient Centralization

收稿日期: 2022-03-08

ZTFLH:

TP 391.41

通讯作者: 钱惠敏,副教授,博士,主要研究方向为计算机视觉、机器学习.E-mail:qhmin0316@163.com.

作者简介: 黄敏,硕士研究生,主要研究方向为人体行为识别.E-mail:2458237010@qq.com.
尚瑞欣,硕士研究生,主要研究方向为人体行为识别.E-mail:525087662@qq.com.

引用本文:

黄敏, 尚瑞欣, 钱惠敏. 面向视频中人体行为识别的复合型深度神经网络[J]. 模式识别与人工智能, 2022, 35(6): 562-570. HUANG Min, SHANG Ruixing, QIAN Huimin. Composite Deep Neural Network for Human Activities Recognition in Video. Pattern Recognition and Artificial Intelligence, 2022, 35(6): 562-570.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202206008 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2022/V35/I6/562

[1] HERATH S, HARANDI M, PORIKLI F. Going Deeper into Action Recognition: A Survey. Image and Vision Computing, 2017, 60: 4-21.
[2] ZHU Y, LI X Y, LIU C H, et al. A Comprehensive Study of Deep Video Action Recognition[C/OL].[2022-03-07]. https://arxiv.org/pdf/2012.06567.pdf.
[3] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: A Large Video Database for Human Motion Recognition // Proc of the International Conference on Computer Vision. Washington, USA: IEEE, 2011: 2556-2563.
[4] SOOMRO K, ZAMIR A R, SHAH M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild[C/OL]. [2022-03-07]. https://arxiv.org/pdf/1212.0402.pdf.
[5] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics Human Action Video Dataset[C/OL].[2022-03-07]. https://arxiv.org/pdf/1705.06950.pdf.
[6] GHADIYARAM D, TRAN D, MAHAJAN D. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 12038-12047.
[7] KARPATHY A, TODERICI G, SHETTY S, et al. Large-Scale Video Classification with Convolutional Neural Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2014: 1725-1732.
[8] KATAOKA H, WAKAMIYA T, HARA K, et al. Would Mega-Scale Datasets Further Enhance Spatiotemporal 3D CNNs?[C/OL]. [2022-03-07]. https://arxiv.org/pdf/2004.04968v1.pdf.
[9] CARREIRA J, ZISSERMAN A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4724-4733.
[10] HARA K, KATAOKA H, SATOH Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6546-6555.
[11] QIU Z F, YAO T, MEI T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5534-5542.
[12] TRAN D, WANG H, TORRESANI L, et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6450-6459.
[13] SIMONYAN K, ZISSERMAN A. Two-Stream Convolutional Networks for Action Recognition in Videos. Communications of the ACM, 2017, 60(6): 84-90.
[14] WANG L M, XIONG Y J, WANG Z, et al. Towards Good Practices for Very Deep Two-Stream ConvNets[C/OL].[2022-03-07]. https://arxiv.org/pdf/1507.02159.pdf.
[15] FEICHTENHOFER C, PINZ A, WILDES R P.Spatiotemporal Residual Networks for Video Action Recognition // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2016: 3476-3484.
[16] WANG L M, XIONG Y J, WANG Z, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 20-36.
[17] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional Two-Stream Network Fusion for Video Action Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 1933-1941.
[18] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Cla-ssification with Deep Convolutional Neural Networks. Communications of the ACM, 2017, 60(6): 84-90.
[19] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[C/OL]. [2022-03-07]. https://arxiv.org/pdf/1409.1556.pdf.
[20] LI X, CHEN S, HU X L, et al. Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 2677-2685.
[21] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning Optical Flow with Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2758-2766.
[22] ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 1647-1655.
[23] SUN D Q, YANG X D, LIU M Y, et al. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2018: 8934-8943.
[24] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature Pyramid Networks for Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 936-944.
[25] ZHU X Z, XIONG Y W, DAI J F, et al. Deep Feature Flow for Video Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 4141-4150.
[26] XU J, RANFTL R, KOLTUN V. Accurate Optical Flow via Direct Cost Volume Processing // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5807-5815.
[27] YONG H W, HUANG J Q, HUA X S, et al. Gradient Centralization: A New Optimization Technique for Deep Neural Networks // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 635-652.
[28] TRAN D, BOURDEV L, FERGUS R, et al. Learning Spatiotemporal Features with 3D Convolutional Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4489-4497.
[29] KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. DOI: 10.1109/TPAMI.2021.3076522
[30] WANG M M, XING J Z, LIU Y.ActionCLIP: A New Paradigm for Video Action Recognition[C/OL]. [2022-03-07].https://arxiv.org/pdf/2109.08472.pdf.